nicar.r-journalism.com/2024/
Andrew Ba Tran @abtran
March 9, 2024
nicar.r-journalism.com/2024/ (Follow along here)
Survey: https://bit.ly/3T6LkQh
Intro to Tidycensus and RStudio
Wrangling Census data with Tidyverse functions
Common Census queries
Visualizing Census data (if there’s time)
Annual survey of 3.5 million US households
Covers more specific topics not available in decennial US Census data (e.g. income, education, language, housing characteristics)
Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
Data delivered as estimates characterized by margins of error
data.census.gov is the main, revamped interactive data portal for browsing and downloading Census datasets, including the ACS
censusreporter.org is a great resource (built by news nerds) and probably a lot of inspiration for the official census website revamp
The US Census Application Programming Interface (API) allows developers to access Census data resources programmatically
R interface to the Decennial Census, American Community Survey, Population Estimates Program, and Public Use Microdata Series APIs
First released in 2017; nearly 500,000 downloads from the Posit CRAN mirror
censusapi by data journalist Hannah Recht
By Kyle Walker
Wrangles Census data internally to return tidyverse-ready format (or traditional wide format if requested)
Automatically downloads and merges Census geometries to data for mapping
Includes tools for handling margins of error in the ACS and working with survey weights in the ACS PUMS
States and counties can be requested by name (no more looking up FIPS codes!)
Script out your process for re usability
R: programming language and software environment for data analysis (and scraping and visualization and so much more)
RStudio: integrated development environment (IDE) for R developed by Posit
<- assignment saves to the environment/memory# hashes, commented out code
To get started, install the packages and files for this class
If you are using an IRE laptop, these packages are already installed for you
tidycensus (and the Census API) can be used without an API key, but you will be limited to 500 queries per day
Power users: visit https://api.census.gov/data/key_signup.html to request a key, then activate the key from the link in your email.
Once activated, use the census_api_key() function to set your key as an environment variable
open 01_tidycensus.R in RStudio
get_acs() functionThe get_acs() function is your portal to access ACS data using tidycensus
The two required arguments are geography and variables. The function defaults to the latest 5-year ACS (Currently 2018-2022)
GEOID, NAME, variable, estimate, and moe# A tibble: 3,222 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B25077_001 191800 7996
2 01003 Baldwin County, Alabama B25077_001 266000 6916
3 01005 Barbour County, Alabama B25077_001 102700 11171
4 01007 Bibb County, Alabama B25077_001 120100 13377
5 01009 Blount County, Alabama B25077_001 159800 6189
6 01011 Bullock County, Alabama B25077_001 87700 20560
7 01013 Butler County, Alabama B25077_001 94800 5984
8 01015 Calhoun County, Alabama B25077_001 140500 5181
9 01017 Chambers County, Alabama B25077_001 116900 9814
10 01019 Cherokee County, Alabama B25077_001 158700 8550
# ℹ 3,212 more rows
get_acs() function to the object median_income1-year ACS data are more current, but are only available for geographies of population 65,000 and greater
Access 1-year ACS data with the argument survey = "acs1"; defaults to "acs5"
# A tibble: 646 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 0103076 Auburn city, Alabama B25077_001 335200 22622
2 0107000 Birmingham city, Alabama B25077_001 125500 14964
3 0121184 Dothan city, Alabama B25077_001 190800 8133
4 0135896 Hoover city, Alabama B25077_001 393400 19743
5 0137000 Huntsville city, Alabama B25077_001 294700 16881
6 0150000 Mobile city, Alabama B25077_001 178800 11552
7 0151000 Montgomery city, Alabama B25077_001 155200 10868
8 0177256 Tuscaloosa city, Alabama B25077_001 297600 30475
9 0203000 Anchorage municipality, Alaska B25077_001 367900 10111
10 0404720 Avondale city, Arizona B25077_001 400300 22495
# ℹ 636 more rows
table parameter can be used to obtain all related variables in a “table” at once# A tibble: 54,774 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B19001_001 22308 369
2 01001 Autauga County, Alabama B19001_002 990 265
3 01001 Autauga County, Alabama B19001_003 656 187
4 01001 Autauga County, Alabama B19001_004 1026 303
5 01001 Autauga County, Alabama B19001_005 1335 329
6 01001 Autauga County, Alabama B19001_006 741 205
7 01001 Autauga County, Alabama B19001_007 822 218
8 01001 Autauga County, Alabama B19001_008 840 270
9 01001 Autauga County, Alabama B19001_009 921 260
10 01001 Autauga County, Alabama B19001_010 962 279
# ℹ 54,764 more rows
| Geography | Definition | Available by | Available in |
|---|---|---|---|
"us" |
United States | get_acs(), get_decennial() |
|
"region" |
Census region | get_acs(), get_decennial() |
|
"division" |
Census division | get_acs(), get_decennial() |
|
"state" |
State or equivalent | state | get_acs(), get_decennial() |
"county" |
County or equivalent | state, county | get_acs(), get_decennial() |
"county subdivision" |
County subdivision | state, county | get_acs(), get_decennial() |
"tract" |
Census tract | state, county | get_acs(), get_decennial() |
"block group" OR "cbg" |
Census block group | state, county | get_acs(), get_decennial() |
For geographies available below the state level, the state parameter allows you to query data for a specific state
For smaller geographies (Census tracts, block groups), a county argument may also need to be included
tidycensus translates state names and postal abbreviations internally, so you don’t need to remember the FIPS codes!
# A tibble: 737 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06073000100 Census Tract 1; San Diego County; Calif… B25077_… 1633800 71171
2 06073000201 Census Tract 2.01; San Diego County; Ca… B25077_… 1331000 147432
3 06073000202 Census Tract 2.02; San Diego County; Ca… B25077_… 891100 97240
4 06073000301 Census Tract 3.01; San Diego County; Ca… B25077_… 957500 232555
5 06073000302 Census Tract 3.02; San Diego County; Ca… B25077_… 761700 108681
6 06073000400 Census Tract 4; San Diego County; Calif… B25077_… 799100 94490
7 06073000500 Census Tract 5; San Diego County; Calif… B25077_… 1025000 81768
8 06073000600 Census Tract 6; San Diego County; Calif… B25077_… 727700 92078
9 06073000700 Census Tract 7; San Diego County; Calif… B25077_… 736400 102788
10 06073000800 Census Tract 8; San Diego County; Calif… B25077_… 678400 119751
# ℹ 727 more rows
To search for variables, use the load_variables() function along with a year and dataset
The View() function in RStudio allows for interactive browsing and filtering
Detailed Tables
Data Profile (add "/profile" for variable lookup)
Subject Tables (add "/subject")
Comparison Profile (add "/cprofile")
Supplemental Estimates (use "acsse")
Migration Flows (access with get_flows())
# A tibble: 2,548 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B01001_001 5074296 NA
2 01 Alabama B01001_002 2461248 6178
3 01 Alabama B01001_003 146169 3134
4 01 Alabama B01001_004 158767 6029
5 01 Alabama B01001_005 164578 5689
6 01 Alabama B01001_006 97834 3029
7 01 Alabama B01001_007 70450 2897
8 01 Alabama B01001_008 42597 4156
9 01 Alabama B01001_009 34623 3440
10 01 Alabama B01001_010 97373 4627
# ℹ 2,538 more rows
# A tibble: 52 × 100
GEOID NAME B01001_001E B01001_001M B01001_002E B01001_002M B01001_003E
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 5074296 NA 2461248 6178 146169
2 02 Alaska 733583 NA 385667 2351 23043
3 04 Arizona 7359197 NA 3678381 2695 201423
4 05 Arkansas 3045637 NA 1504488 4216 90239
5 06 California 39029342 NA 19536425 6410 1081904
6 08 Colorado 5839926 NA 2960896 4278 154565
7 09 Connecticut 3626205 NA 1776689 2237 91513
8 10 Delaware 1018396 NA 494657 1092 27456
9 11 District o… 671803 NA 319763 733 20038
10 12 Florida 22244823 NA 10953468 6169 563703
# ℹ 42 more rows
# ℹ 93 more variables: B01001_003M <dbl>, B01001_004E <dbl>, B01001_004M <dbl>,
# B01001_005E <dbl>, B01001_005M <dbl>, B01001_006E <dbl>, B01001_006M <dbl>,
# B01001_007E <dbl>, B01001_007M <dbl>, B01001_008E <dbl>, B01001_008M <dbl>,
# B01001_009E <dbl>, B01001_009M <dbl>, B01001_010E <dbl>, B01001_010M <dbl>,
# B01001_011E <dbl>, B01001_011M <dbl>, B01001_012E <dbl>, B01001_012M <dbl>,
# B01001_013E <dbl>, B01001_013M <dbl>, B01001_014E <dbl>, …
Census variables can be hard to remember; using a named vector to request variables will replace the Census IDs with a custom input
In long form, these custom inputs will populate the variable column; in wide form, they will replace the column names
# A tibble: 174 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 06001 Alameda County, California percent_high_school 16.7 0.4
2 06001 Alameda County, California percent_bachelors 28.3 0.3
3 06001 Alameda County, California percent_graduate 21.3 0.3
4 06003 Alpine County, California percent_high_school 25.7 7.5
5 06003 Alpine County, California percent_bachelors 20.6 7.5
6 06003 Alpine County, California percent_graduate 18.7 8.5
7 06005 Amador County, California percent_high_school 30.7 2.2
8 06005 Amador County, California percent_bachelors 13.6 1.8
9 06005 Amador County, California percent_graduate 5.9 1.1
10 06007 Butte County, California percent_high_school 22.3 0.9
# ℹ 164 more rows
Values available in the 5-year ACS may not be available in the corresponding 1-year ACS tables
If available, they will likely have larger margins of error
Your job as an data journalist: balance need for certainty vs. need for recency in estimates
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_099 5222 1487
2 02 Alaska B16001_099 NA NA
3 04 Arizona B16001_099 28522 3746
4 05 Arkansas B16001_099 NA NA
5 06 California B16001_099 760215 22953
6 08 Colorado B16001_099 9417 2178
7 09 Connecticut B16001_099 8568 2657
8 10 Delaware B16001_099 1112 601
9 11 District of Columbia B16001_099 NA NA
10 12 Florida B16001_099 80209 7025
# ℹ 42 more rows
# A tibble: 52 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01 Alabama B16001_099 3854 553
2 02 Alaska B16001_099 18520 1414
3 04 Arizona B16001_099 25913 1774
4 05 Arkansas B16001_099 3154 515
5 06 California B16001_099 772833 10558
6 08 Colorado B16001_099 8724 834
7 09 Connecticut B16001_099 8353 969
8 10 Delaware B16001_099 2837 668
9 11 District of Columbia B16001_099 1250 306
10 12 Florida B16001_099 70430 2877
# ℹ 42 more rows
Variables in the Data Profile and Subject Tables can change names over time
The 2022 ACS is the first to include the new Connecticut Planning Regions in the “county” geography
The 2020 1-year ACS was not released (and is not in tidycensus), so your time-series can break if you are using iteration to pull data
Complete count of the US population mandated by Article 1, Sections 2 and 9 in the US Constitution
Directed by the US Census Bureau (US Department of Commerce); conducted every 10 years since 1790
Used for proportional representation / congressional redistricting
Limited set of questions asked about race, ethnicity, age, sex, and housing tenure
The get_decennial() function is used to acquire data from the decennial US Census
The two required arguments are geography and variables for the functions to work; for 2020 Census data, use year = 2020.
# A tibble: 52 × 4
GEOID NAME variable value
<chr> <chr> <chr> <dbl>
1 42 Pennsylvania P1_001N 13002700
2 06 California P1_001N 39538223
3 54 West Virginia P1_001N 1793716
4 49 Utah P1_001N 3271616
5 36 New York P1_001N 20201249
6 11 District of Columbia P1_001N 689545
7 02 Alaska P1_001N 733391
8 12 Florida P1_001N 21538187
9 45 South Carolina P1_001N 5118425
10 38 North Dakota P1_001N 779094
# ℹ 42 more rows
get_decennial() for the 2020 Census for the first time, we see the following messages:Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
Note: 2020 decennial Census data use differential privacy, a technique that
introduces errors into data to preserve respondent confidentiality.
ℹ Small counts should be interpreted with caution.
ℹ See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
This message is displayed once per session.
The Census Bureau is using differential privacy in an attempt to preserve respondent confidentiality in the 2020 Census data, which is required under US Code Title 13
Intentional errors are introduced into data, impacting the accuracy of small area counts (e.g. some blocks with children, but no adults)
Advocates argue that differential privacy is necessary to satisfy Title 13 requirements given modern database reconstruction technologies; critics contend that the method makes data less useful with no tangible privacy benefit
Can you look through the vars table you loaded earlier and import the table that can answer this?
How do you find the “right” variables or Census table ID? I do a couple things: Use CensusReporter.org or I ask the oldest data reporter in the newsroom.
open 02_wrangling_census_data.R in RStudio
The basics to wrangle data
filter() gets rid of rowsmutate() adds columns to the dataframegroup_by() and summarize() will aggregate the data by groupsarrange() will sort the dataselect() will help narrow down columns|># A tibble: 22,554 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 01001 Autauga County, Alabama B03002_001 58761 NA
2 01001 Autauga County, Alabama B03002_003 42635 224
3 01001 Autauga County, Alabama B03002_004 11496 525
4 01001 Autauga County, Alabama B03002_005 59 71
5 01001 Autauga County, Alabama B03002_006 636 213
6 01001 Autauga County, Alabama B03002_007 0 30
7 01001 Autauga County, Alabama B03002_012 1864 NA
8 01003 Baldwin County, Alabama B03002_001 233420 NA
9 01003 Baldwin County, Alabama B03002_003 192161 1020
10 01003 Baldwin County, Alabama B03002_004 19318 730
# ℹ 22,544 more rows
summary_varcounty_diversity <- get_acs(geography = "county",
variables = c("B03002_003", # white alone
"B03002_004", # black alone
"B03002_005", # native american
"B03002_006", # asian alone
"B03002_007", # pi alone
"B03002_012" # hispanic or latino
),
summary_var = "B03002_001", # total population
survey="acs5",
year=2022)# A tibble: 19,332 × 7
GEOID NAME variable estimate moe summary_est summary_moe
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01001 Autauga County, Alabama B03002_… 42635 224 58761 NA
2 01001 Autauga County, Alabama B03002_… 11496 525 58761 NA
3 01001 Autauga County, Alabama B03002_… 59 71 58761 NA
4 01001 Autauga County, Alabama B03002_… 636 213 58761 NA
5 01001 Autauga County, Alabama B03002_… 0 30 58761 NA
6 01001 Autauga County, Alabama B03002_… 1864 NA 58761 NA
7 01003 Baldwin County, Alabama B03002_… 192161 1020 233420 NA
8 01003 Baldwin County, Alabama B03002_… 19318 730 233420 NA
9 01003 Baldwin County, Alabama B03002_… 512 187 233420 NA
10 01003 Baldwin County, Alabama B03002_… 2046 347 233420 NA
# ℹ 19,322 more rows
mutate() to add a new column to the data frame# A tibble: 19,332 × 7
GEOID NAME variable estimate moe summary_est percent
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01001 Autauga County, Alabama B03002_003 42635 224 58761 72.6
2 01001 Autauga County, Alabama B03002_004 11496 525 58761 19.6
3 01001 Autauga County, Alabama B03002_005 59 71 58761 0.100
4 01001 Autauga County, Alabama B03002_006 636 213 58761 1.08
5 01001 Autauga County, Alabama B03002_007 0 30 58761 0
6 01001 Autauga County, Alabama B03002_012 1864 NA 58761 3.17
7 01003 Baldwin County, Alabama B03002_003 192161 1020 233420 82.3
8 01003 Baldwin County, Alabama B03002_004 19318 730 233420 8.28
9 01003 Baldwin County, Alabama B03002_005 512 187 233420 0.219
10 01003 Baldwin County, Alabama B03002_006 2046 347 233420 0.877
# ℹ 19,322 more rows
case_when() to refactor values (within mutate()).default is else or if none of the factors match|> are the new pipes, aka “and then”county_diversity_race <- county_diversity |>
mutate(race=case_when(
variable=="B03002_003" ~"White",
variable=="B03002_004" ~"Black",
variable=="B03002_005" ~"Native American",
variable=="B03002_006" ~"Asian",
variable=="B03002_007" ~"Pacific Islander",
variable=="B03002_012" ~"Hispanic",
.default = "Other"
))# A tibble: 19,332 × 6
GEOID NAME estimate summary_est percent race
<chr> <chr> <dbl> <dbl> <dbl> <chr>
1 01001 Autauga County, Alabama 42635 58761 72.6 White
2 01001 Autauga County, Alabama 11496 58761 19.6 Black
3 01001 Autauga County, Alabama 59 58761 0.100 Native American
4 01001 Autauga County, Alabama 636 58761 1.08 Asian
5 01001 Autauga County, Alabama 0 58761 0 Pacific Islander
6 01001 Autauga County, Alabama 1864 58761 3.17 Hispanic
7 01003 Baldwin County, Alabama 192161 233420 82.3 White
8 01003 Baldwin County, Alabama 19318 233420 8.28 Black
9 01003 Baldwin County, Alabama 512 233420 0.219 Native American
10 01003 Baldwin County, Alabama 2046 233420 0.877 Asian
# ℹ 19,322 more rows
group_by() to group up thingssummarize() to do something (usually math) on these groupscounty_diversity_percent <- county_diversity |>
mutate(race=case_when(
variable=="B03002_003" ~"White",
variable=="B03002_004" ~"Black",
variable=="B03002_005" ~"Native American",
variable=="B03002_006" ~"Asian Pacific Islander",
variable=="B03002_007" ~"Asian Pacific Islander",
variable=="B03002_012" ~"Hispanic",
.default = "Other"
)) |>
group_by(GEOID, NAME, race) |>
summarize(estimate=sum(estimate, na.rm=T),
summary_est=mean(summary_est, na.rm=T)) |>
mutate(percent=estimate/summary_est*100)# A tibble: 16,110 × 5
NAME race estimate summary_est percent
<chr> <chr> <dbl> <dbl> <dbl>
1 Autauga County, Alabama Asian Pacific Islander 636 58761 1.08
2 Autauga County, Alabama Black 11496 58761 19.6
3 Autauga County, Alabama Hispanic 1864 58761 3.17
4 Autauga County, Alabama Native American 59 58761 0.100
5 Autauga County, Alabama White 42635 58761 72.6
6 Baldwin County, Alabama Asian Pacific Islander 2068 233420 0.886
7 Baldwin County, Alabama Black 19318 233420 8.28
8 Baldwin County, Alabama Hispanic 11210 233420 4.80
9 Baldwin County, Alabama Native American 512 233420 0.219
10 Baldwin County, Alabama White 192161 233420 82.3
# ℹ 16,100 more rows
arrange() function# A tibble: 16,110 × 4
NAME race estimate percent
<chr> <chr> <dbl> <dbl>
1 Abbeville County, South Carolina Native American 28 0.115
2 Abbeville County, South Carolina Asian Pacific Islander 58 0.238
3 Abbeville County, South Carolina Hispanic 441 1.81
4 Abbeville County, South Carolina Black 6372 26.1
5 Abbeville County, South Carolina White 16658 68.4
6 Acadia Parish, Louisiana Native American 13 0.0225
7 Acadia Parish, Louisiana Asian Pacific Islander 95 0.165
8 Acadia Parish, Louisiana Hispanic 1780 3.09
9 Acadia Parish, Louisiana Black 9511 16.5
10 Acadia Parish, Louisiana White 44413 77.0
# ℹ 16,100 more rows
arrange() functiondesc() function# A tibble: 16,110 × 5
NAME race estimate summary_est percent
<chr> <chr> <dbl> <dbl> <dbl>
1 Abbeville County, South Carolina White 16658 24368 68.4
2 Abbeville County, South Carolina Black 6372 24368 26.1
3 Abbeville County, South Carolina Hispanic 441 24368 1.81
4 Abbeville County, South Carolina Asian Pacific … 58 24368 0.238
5 Abbeville County, South Carolina Native American 28 24368 0.115
6 Acadia Parish, Louisiana White 44413 57674 77.0
7 Acadia Parish, Louisiana Black 9511 57674 16.5
8 Acadia Parish, Louisiana Hispanic 1780 57674 3.09
9 Acadia Parish, Louisiana Asian Pacific … 95 57674 0.165
10 Acadia Parish, Louisiana Native American 13 57674 0.0225
# ℹ 16,100 more rows
Notice there are 16,110 rows…
filter() function# A tibble: 3,222 × 5
NAME race estimate summary_est percent
<chr> <chr> <dbl> <dbl> <dbl>
1 Abbeville County, South Carolina White 16658 24368 68.4
2 Acadia Parish, Louisiana White 44413 57674 77.0
3 Accomack County, Virginia White 19813 33367 59.4
4 Ada County, Idaho White 411348 497494 82.7
5 Adair County, Iowa White 7046 7479 94.2
6 Adair County, Kentucky White 17308 18887 91.6
7 Adair County, Missouri White 22367 25299 88.4
8 Adair County, Oklahoma Native American 8468 19726 42.9
9 Adams County, Colorado White 247045 520149 47.5
10 Adams County, Idaho White 4009 4464 89.8
# ℹ 3,212 more rows
Now there are 3,222 rows.
Which lines up with the county count in the U.S.
slice() functionCopy and paste over the code you worked so hard on and change the geography and add state and county.
sd_tract_diversity <- get_acs(geography = "tract",
state = "California",
county = "San Diego",
variables = c("B03002_003", # white alone
"B03002_004", # black alone
"B03002_005", # native american
"B03002_006", # asian alone
"B03002_007", # pi alone
"B03002_012" # hispanic or latino
),
summary_var = "B03002_001", # total population
survey="acs5",
year=2022)Nothing changes except the names of the data frames
sd_tract_diversity_plurality <- sd_tract_diversity |>
mutate(race=case_when(
variable=="B03002_003" ~"White",
variable=="B03002_004" ~"Black",
variable=="B03002_005" ~"Native American",
variable=="B03002_006" ~"Asian Pacific Islander",
variable=="B03002_007" ~"Asian Pacific Islander",
variable=="B03002_012" ~"Hispanic",
.default = "Other"
)) |>
group_by(GEOID, NAME, race) |>
summarize(estimate=sum(estimate, na.rm=T),
summary_est=mean(summary_est, na.rm=T)) |>
mutate(percent=estimate/summary_est*100) |>
group_by(GEOID, NAME) |>
arrange(GEOID, NAME, desc(percent)) |>
slice(1)# A tibble: 737 × 5
GEOID race estimate summary_est percent
<chr> <chr> <dbl> <dbl> <dbl>
1 06073000100 White 2116 3027 69.9
2 06073000201 White 1926 2294 84.0
3 06073000202 White 2873 3919 73.3
4 06073000301 White 1331 2340 56.9
5 06073000302 White 2182 2934 74.4
6 06073000400 White 1984 3802 52.2
7 06073000500 White 2002 2934 68.2
8 06073000600 White 2044 3144 65.0
9 06073000700 White 2934 4631 63.4
10 06073000800 White 3097 5257 58.9
# ℹ 727 more rows
inner_join() from dplyr# A tibble: 736 × 6
GEOID race estimate summary_est percent total_evictions
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 06073000100 White 2116 3027 69.9 1
2 06073000201 White 1926 2294 84.0 2
3 06073000202 White 2873 3919 73.3 4
4 06073000301 White 1331 2340 56.9 2
5 06073000302 White 2182 2934 74.4 10
6 06073000400 White 1984 3802 52.2 6
7 06073000500 White 2002 2934 68.2 3
8 06073000600 White 2044 3144 65.0 2
9 06073000700 White 2934 4631 63.4 3
10 06073000800 White 3097 5257 58.9 20
# ℹ 726 more rows
Now you can answer which neighborhoods in San Diego had the higher eviction rates.
sd_joined |>
group_by(race) |>
summarize(population=sum(summary_est, na.rm=T),
total_evictions=sum(total_evictions, na.rm=T)) |>
mutate(rate_of_evictions=total_evictions/population*1000) |>
arrange(desc(rate_of_evictions))# A tibble: 4 × 4
race population total_evictions rate_of_evictions
<chr> <dbl> <dbl> <dbl>
1 Black 36 1 27.8
2 Hispanic 1123001 2121 1.89
3 White 1968310 2816 1.43
4 Asian Pacific Islander 198354 212 1.07
open 03_common_census_queries.R in RStudio
Here’s a basic “for loop” which includes setting the limits for the loop to 10.
Manually, this would have looked like print(1) then print(2) then print(3) one by one.
Each loop iterates the i within the established limits (1:10)
But this is a way to run code many times with slight variations to a value or values in the code. It all goes between the { and }.
big_census_data <- tibble() # creates a blank data frame
for (i in 2020:2022) {
median_df <- get_acs( # temporary dataframe
geography = "county",
variables = "B25077_001", # median home values
year = i
) |>
mutate(year = i) # so we can identify which year
big_census_data <- bind_rows(big_census_data, median_df) |>
arrange(GEOID, year)
# appends the temporary dataframe to the permanent one
}# A tibble: 9,664 × 6
GEOID NAME variable estimate moe year
<chr> <chr> <chr> <dbl> <dbl> <int>
1 01001 Autauga County, Alabama B25077_001 161200 6652 2020
2 01001 Autauga County, Alabama B25077_001 164900 6871 2021
3 01001 Autauga County, Alabama B25077_001 191800 7996 2022
4 01003 Baldwin County, Alabama B25077_001 211600 5853 2020
5 01003 Baldwin County, Alabama B25077_001 226600 5984 2021
6 01003 Baldwin County, Alabama B25077_001 266000 6916 2022
7 01005 Barbour County, Alabama B25077_001 86500 9981 2020
8 01005 Barbour County, Alabama B25077_001 89500 11054 2021
9 01005 Barbour County, Alabama B25077_001 102700 11171 2022
10 01007 Bibb County, Alabama B25077_001 96400 13625 2020
# ℹ 9,654 more rows
# A tibble: 3,230 × 4
NAME `2020` `2022` change
<chr> <dbl> <dbl> <dbl>
1 Jackson County, South Dakota 58600 108400 85.0
2 Crockett County, Texas 102200 181300 77.4
3 Real County, Texas 146900 258000 75.6
4 De Baca County, New Mexico 130000 219200 68.6
5 Keya Paha County, Nebraska 65000 109000 67.7
6 Daggett County, Utah 152500 252400 65.5
7 Stewart County, Georgia 44100 72500 64.4
8 Gem County, Idaho 198700 324700 63.4
9 Shackelford County, Texas 110000 178500 62.3
10 Sterling County, Texas 75000 121000 61.3
# ℹ 3,220 more rows
state_names <- c("DC", "MD", "VA") # Get a list of state names or abbreviations
tract_data <- tibble()
for (i in 1:length(state_names)) {
tract_df <- get_acs(
geography = "tract",
variables = "B25077_001",
year = 2022,
state=state_names[i] # Swap out the state name in the array
)
tract_data <- bind_rows(tract_data, tract_df)
}# A tibble: 3,879 × 5
GEOID NAME variable estimate moe
<chr> <chr> <chr> <dbl> <dbl>
1 11001000101 Census Tract 1.01; District of Columbia… B25077_… 635000 368769
2 11001000102 Census Tract 1.02; District of Columbia… B25077_… 1382800 636468
3 11001000201 Census Tract 2.01; District of Columbia… B25077_… NA NA
4 11001000202 Census Tract 2.02; District of Columbia… B25077_… 1385700 124023
5 11001000300 Census Tract 3; District of Columbia; D… B25077_… 1110900 40872
6 11001000400 Census Tract 4; District of Columbia; D… B25077_… 1620800 494698
7 11001000501 Census Tract 5.01; District of Columbia… B25077_… 1131400 180121
8 11001000502 Census Tract 5.02; District of Columbia… B25077_… 1168900 532289
9 11001000600 Census Tract 6; District of Columbia; D… B25077_… 1426300 221155
10 11001000702 Census Tract 7.02; District of Columbia… B25077_… 365400 22326
# ℹ 3,869 more rows
state.namestate.abbstate_names <- c(state.name, "District of Columbia")
state_abbs <- c(state.abb, "DC")
state_df <- data.frame(state_names, state_abbs)
state_df state_names state_abbs
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
6 Colorado CO
7 Connecticut CT
8 Delaware DE
9 Florida FL
10 Georgia GA
11 Hawaii HI
12 Idaho ID
13 Illinois IL
14 Indiana IN
15 Iowa IA
16 Kansas KS
17 Kentucky KY
18 Louisiana LA
19 Maine ME
20 Maryland MD
21 Massachusetts MA
22 Michigan MI
23 Minnesota MN
24 Mississippi MS
25 Missouri MO
26 Montana MT
27 Nebraska NE
28 Nevada NV
29 New Hampshire NH
30 New Jersey NJ
31 New Mexico NM
32 New York NY
33 North Carolina NC
34 North Dakota ND
35 Ohio OH
36 Oklahoma OK
37 Oregon OR
38 Pennsylvania PA
39 Rhode Island RI
40 South Carolina SC
41 South Dakota SD
42 Tennessee TN
43 Texas TX
44 Utah UT
45 Vermont VT
46 Virginia VA
47 Washington WA
48 West Virginia WV
49 Wisconsin WI
50 Wyoming WY
51 District of Columbia DC
03_common_census_queries.Ropen 04_tidycensusviz.R
One of the best features of tidycensus is the argument geometry = TRUE, which gets you the correct Census geometries with no hassle
get_acs() with geometry = TRUE returns a spatial Census dataset containing simple feature geometries;
geometry = TRUE does the hard work for you of acquiring and pre-joining spatial Census dataSimple feature collection with 199 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -76.71152 ymin: 39.19721 xmax: -76.52945 ymax: 39.3722
Geodetic CRS: NAD83
First 10 features:
GEOID estimate moe geometry
1 24510151200 99400 51399 MULTIPOLYGON (((-76.66506 3...
2 24510270702 218000 17102 MULTIPOLYGON (((-76.57356 3...
3 24510271802 114800 40020 MULTIPOLYGON (((-76.68352 3...
4 24510272004 210600 76149 MULTIPOLYGON (((-76.69473 3...
5 24510090700 78700 27019 MULTIPOLYGON (((-76.60149 3...
6 24510180300 209900 18838 MULTIPOLYGON (((-76.63817 3...
7 24510270703 229600 35850 MULTIPOLYGON (((-76.56192 3...
8 24510010300 338900 21159 MULTIPOLYGON (((-76.58472 3...
9 24510060300 280700 21590 MULTIPOLYGON (((-76.58773 3...
10 24510260203 152800 11899 MULTIPOLYGON (((-76.55386 3...
zcolcounty_migration <- get_flows(
geography = "county",
county = "Baltimore City",
state = "MD"
)
county_migration# A tibble: 2,664 × 7
GEOID1 GEOID2 FULL1_NAME FULL2_NAME variable estimate moe
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 24510 <NA> Baltimore city, Maryland Africa MOVEDIN 685 194
2 24510 <NA> Baltimore city, Maryland Africa MOVEDOUT NA NA
3 24510 <NA> Baltimore city, Maryland Africa MOVEDNET NA NA
4 24510 <NA> Baltimore city, Maryland Asia MOVEDIN 1466 279
5 24510 <NA> Baltimore city, Maryland Asia MOVEDOUT NA NA
6 24510 <NA> Baltimore city, Maryland Asia MOVEDNET NA NA
7 24510 <NA> Baltimore city, Maryland Central Ameri… MOVEDIN 322 174
8 24510 <NA> Baltimore city, Maryland Central Ameri… MOVEDOUT NA NA
9 24510 <NA> Baltimore city, Maryland Central Ameri… MOVEDNET NA NA
10 24510 <NA> Baltimore city, Maryland Caribbean MOVEDIN 231 94
# ℹ 2,654 more rows
county_map <- get_acs(
geography = "county",
variable = c("Population"="B03002_001"),
geometry = TRUE
)
county_mapSimple feature collection with 3222 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1467 ymin: 17.88328 xmax: 179.7785 ymax: 71.38782
Geodetic CRS: NAD83
First 10 features:
GEOID NAME variable estimate moe
1 01069 Houston County, Alabama Population 107040 NA
2 01023 Choctaw County, Alabama Population 12669 NA
3 01005 Barbour County, Alabama Population 24877 NA
4 01107 Pickens County, Alabama Population 18925 NA
5 01033 Colbert County, Alabama Population 57270 NA
6 04012 La Paz County, Arizona Population 16681 NA
7 04001 Apache County, Arizona Population 66054 NA
8 05081 Little River County, Arkansas Population 12024 NA
9 05121 Randolph County, Arkansas Population 18619 NA
10 06037 Los Angeles County, California Population 9936690 NA
geometry
1 MULTIPOLYGON (((-85.71209 3...
2 MULTIPOLYGON (((-88.47323 3...
3 MULTIPOLYGON (((-85.74803 3...
4 MULTIPOLYGON (((-88.34043 3...
5 MULTIPOLYGON (((-88.13925 3...
6 MULTIPOLYGON (((-114.7312 3...
7 MULTIPOLYGON (((-110.0007 3...
8 MULTIPOLYGON (((-94.48558 3...
9 MULTIPOLYGON (((-91.40687 3...
10 MULTIPOLYGON (((-118.6044 3...